## Usage

First, set the environment variable:

```
export HF_HOME='your HF token'
```

Then, prepare the environment following the guidance from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). Our used version is 0.9.1.dev0. Please refer to [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for the meaning of the settings in the YAML file, including the way to change the training dataset and the model. You can compare the installed environment with "icrm_requirement.txt".



### Training reward models
Run the following Python files in "./data/ulfeedback" to prepare 400K samples from the Unified-Feedback dataset and Python files in "./data/skywork" to prepare Skywork dataset. Make sure the execution path is the project path:
```
python load_data.py (ulfeedback)
python load_data_400k.py (ulfeedback)
python build_dataset.py (skywork)
```
To acquire data generated by Qwen, you first need to load data from [prm800k](https://github.com/openai/prm800k/tree/main/prm800k/data) into "./data/prm_800k_handler/data", then run the following Python files in "data/qwen_answer_generator". Make sure the execution path is the project path. You can also use the data we have already generated:

```
python generate.py
python generate2orm.py
```

Our training scripts can be found in examples/train_lora. It is worth noting that in practice, we saved the gemma model in a local gemma-2b-it folder to avoid network and environmental issues. To achieve the proposed method, you need first to train the generator for the 2B model:

```
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/feedback_40k_train_sft_lora.yaml
```

Then you can train our proposed reward model based on the 2B model. You can train GRM by changing reg_ratio to 0 in "examples/train_lora/feedback_40k_taorm_lora.yaml":

```
CUDA_VISIBLE_DEVICES=0,1 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/feedback_400k_taorm_lora.yaml
```

After training, you can acquire the proposed reward model with:

```
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
```
You can acquire GRM after training by changing export_dir to "models/400k" in "examples/merge_lora/llama3_lora_sft.yaml".

For Qwen-1.5B model, the reward training and model merging can be achieved by:
```
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/math_qwen_lora.yaml
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml
```

### Evaluating reward models

To evaluate on RewardBench, first follow the [original guidance](https://github.com/allenai/reward-bench) to prepare the environment. Then you can use the code in the reward_bench folder to replace the original reward_bench code. Put the model into "./ATORM/lora_reward_sft_reg" and evaluate it with:

```
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=./ATORM/lora_reward_sft_reg --batch_size=8 --tokenizer=./ATORM/lora_reward_sft_reg --do_not_save
```

BON in the general scenario includes two processes: preparing the data and executing the BON evaluation.

Data preparation:

```
CUDA_VISIBLE_DEVICES=0 python vllm_generate_data.py --model google/gemma-2b-it --max-tokens 1536 --temperature 1.0 --top-p 1.0 --sampling-num 16 --prompt_file eval_data/reward_bench_500_content.json --sampling_range [0,500] --record_file eval_data/reward_bench_500_content0_500_generation.json
```
You can change the model in the above instruction to "meta-llama/Meta-Llama-3-8B-Instruct" to acquire responses generated by Llama3-8B-Instruct and change record_file to save responses in a new file path.

Execute BON selection and evaluation with the following instruction. Note that you should train both GRM and our proposed reward model. You can change file_path and N in "BON_analysis.py" to acquire BON results on different record_files and pass@N:

```
CUDA_VISIBLE_DEVICES=0 python BON_analysis.py
```

BON verification in math tasks also includes two processes: preparing the data and executing the BON evaluation. The data are acquired from [implicit-prm](https://github.com/PRIME-RL/ImplicitPRM/tree/main/eval/testset). Unzip these files and put them in "./eval_data". Then you can run the instruction to acquire results of the proposed reward model for responses generated by Llama with the following instruction (note to train the Qwen reward model first):
```
python math_tools.py
CUDA_VISIBLE_DEVICES=0 python math_BON_analysis.py
```

RLHF includes three processes: preparing the data, training the DPO policy, and evaluating the DPO policy. We have already split the training and test sets. Other data preparation is shown below. Note that when executing "vllm_generate_data.py", you should follow the [vllm project](https://github.com/vllm-project/vllm) to prepare the environment:

```
CUDA_VISIBLE_DEVICES=0 python vllm_generate_data.py --max-tokens 1536 --temperature 1.0 --top-p 1.0 --sampling-num 8 --prompt_file eval_data/reward_bench_train_content.json --sampling_range [0,6000] --record_file eval_data/reward_bench_train_content0_6000_generation.json
CUDA_VISIBLE_DEVICES=0 python select_pair.py
```
You can prepare data for GRM by changing goal_name to "./data/train/ulfeedback/dpo_grm_400k_reward_bench.json" and goal_name_filter to "./data/train/ulfeedback/dpo_grm_400k_reward_bench_filter.json" in "select_pair.py".

Train DPO policy:

```
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/reward_dpo_training.yaml
```
You can train DPO policy induced by GRM by changing the dataset to "dpo_grm400K" in "examples/train_lora/reward_dpo_training.yaml".

Acquire DPO policy with:
```
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 llamafactory-cli export examples/merge_lora/lora_dpo.yaml
```
You can acquire DPO policy induced by GRM by changing export_dir to "models/dpo_grm400k" in "examples/merge_lora/lora_dpo.yaml".

Evaluate DPO policy (Note that you should train both policies induced by GRM and our proposed reward model first):

```
CUDA_VISIBLE_DEVICES=0 python evaluate_policy.py
```

Reward analysis:
```
CUDA_VISIBLE_DEVICES=0 python reward_analysis.py
```

Length analysis (note that you need to install the package "seaborn" first):
```
CUDA_VISIBLE_DEVICES=0 python length_analysis.py
```